Skip to content

Add leadership "domains" so multiple Rivers can operate in one schema#1113

Open
brandur wants to merge 1 commit intomasterfrom
brandur-leadership-realms
Open

Add leadership "domains" so multiple Rivers can operate in one schema#1113
brandur wants to merge 1 commit intomasterfrom
brandur-leadership-realms

Conversation

@brandur
Copy link
Copy Markdown
Contributor

@brandur brandur commented Dec 23, 2025

We've gotten a couple requests so far (see #342 and #1105) to be able to
start multiple River clients targeting different queues within the same
database/schema, and giving them the capacity to operate independently
enough to be functional. This is currently not possible because a single
leader is elected given a single schema and it handles all maintenance
operations including non-queue ones like periodic job enqueuing.

Here, add the idea of a LeaderDomain. This lets a user set the
"domain" on which a client will elect its leader and allowing multiple
leaders to be elected in a single schema. Each leader will run its own
maintenance services.

Setting LeaderDomain causes the additional effect of having
maintenance services start to operate only on the queues that their
client is configured for. The idea here is to give us backwards
compatibility in that the default behavior (in case of an unset
LeaderDomain) is the same, but providing a path for multiple leaders
to be interoperable with each other.

There are still a few edges: for example, reindexing is not queue
specific, so multiple leaders could be running a reindexer. I've
provided guidance in the config documentation that ideally, all clients
but one should have their reindexer disabled.

@brandur brandur requested a review from bgentry December 23, 2025 22:21
@brandur
Copy link
Copy Markdown
Contributor Author

brandur commented Dec 23, 2025

@bgentry This works (I think), but still needs some testing added. Wanted to get your general reaction before taking it all the way.

@brandur
Copy link
Copy Markdown
Contributor Author

brandur commented Dec 24, 2025

Okay, the tests for this should be in good shape now.

@@ -0,0 +1,3 @@
ALTER TABLE /* TEMPLATE: schema */river_leader
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're adding a migration here, before shipping I'd also go through the existing "needs migration" issues and pull them in.

One nice thing about this migration as currently written is that if you did not run it, it wouldn't be a problem as long as you didn't try to use LeaderDomain. So it's safer than your average migration.

@brandur brandur force-pushed the brandur-leadership-realms branch 2 times, most recently from ec7e682 to f41afcc Compare December 26, 2025 19:31
Comment thread client.go Outdated
// because the default client(s) will infringe on the domains of the
// non-default one(s).
//
// Certain maintenance services that aren't queue-related like the indexer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Certain maintenance services that aren't queue-related like the indexer
// Certain maintenance services that aren't queue-related like the reindexer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thx.

Comment thread client.go
Comment on lines +247 to +249
// In general, most River users should not need LeaderDomain, and when
// running multiple Rivers may want to consider using multiple databases and
// multiple schemas instead.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably try to get some version of this warning in the first paragraph to dissuade people from using this feature. Could describe it as "an advanced option" or something like that. I would just like to ensure people don't start using this setting automatically just bc it's there because it has some major footguns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, great point. Fixed.

Comment thread client.go
Comment on lines +229 to +234
// A warning though that River *does not protect against configuration
// mistakes*. If client1 on domain1 is configured for queue_a and queue_b,
// and client2 on domain2 is *also* configured for queue_a and queue_b, then
// both clients may end up running maintenance services on the same queues
// at the same time. It's the caller's responsibility to ensure that doesn't
// happen.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta say this definitely feels dangerous. I'm wondering what else can break if we allow for breaking one of the main promises of leader election (there can only be one), including Pro features.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's admittedly not super great. I think the only alternatives though involving storing something to the database so clients can compare notes with one another, which is also not great for other reasons.

Comment thread client.go Outdated
Comment on lines +907 to +908
// It's important for queuesIncluded to be `nil` in case it's not in use
// for the various driver queries to work correctly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the driver layer should handle nil and []string{} equivalently to avoid needing to deal with that concern at this level?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. Modified the drivers to do that and added tests to each query in question.

Comment thread internal/leadership/elector.go Outdated
var sub *notifier.Subscription
if e.notifier == nil {
e.Logger.DebugContext(ctx, e.Name+": No notifier configured; starting in poll mode", "client_id", e.config.ClientID)
e.Logger.DebugContext(ctx, e.Name+": Resigned leadership successfully", "client_id", e.config.ClientID, "domain", e.config.Domain)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this log text altered by mistake?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thx. Fixed.

Comment thread internal/leadership/elector.go Outdated
}

e.Logger.DebugContext(ctx, e.Name+": Current leader attempting reelect", "client_id", e.config.ClientID)
e.Logger.InfoContext(ctx, e.Name+": Current leader received forced resignation", "client_id", e.config.ClientID, "domain", e.config.Domain)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the text here was also changed, I think accidentally

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thx. Fixed.

Comment thread internal/leadership/elector.go Outdated
// another client, even in the event of a cancellation.
func (e *Elector) attemptResignLoop(ctx context.Context) {
e.Logger.DebugContext(ctx, e.Name+": Attempting to resign leadership", "client_id", e.config.ClientID)
e.Logger.InfoContext(ctx, e.Name+": Current leader received forced resignation", "client_id", e.config.ClientID, "domain", e.config.Domain)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this text is necessarily true, there are other reasons this could be called besides a forced resignation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's wrong. Reverted.

Comment on lines 93 to +107
-- name: JobDeleteBefore :execresult
DELETE FROM /* TEMPLATE: schema */river_job
WHERE
id IN (
SELECT id
FROM /* TEMPLATE: schema */river_job
WHERE
WHERE id IN (
SELECT id
FROM /* TEMPLATE: schema */river_job
WHERE (
(state = 'cancelled' AND finalized_at < cast(@cancelled_finalized_at_horizon AS text)) OR
(state = 'completed' AND finalized_at < cast(@completed_finalized_at_horizon AS text)) OR
(state = 'discarded' AND finalized_at < cast(@discarded_finalized_at_horizon AS text))
ORDER BY id
LIMIT @max
)
-- This is really awful, but unless the `sqlc.slice` appears as the very
-- last parameter in the query things will fail if it includes more than one
-- element. The sqlc SQLite driver uses position-based placeholders (?1) for
-- most parameters, but unnamed ones with `sqlc.slice` (?), and when
-- positional parameters follow unnamed parameters great confusion is the
-- result. Making sure `sqlc.slice` is last is the only workaround I could
-- find, but it stops working if there are multiple clauses that need a
-- positional placeholder plus `sqlc.slice` like this one (the Postgres
-- driver supports a `queues_included` parameter that I couldn't support
-- here). The non-workaround version is (unfortunately) to never, ever use
-- the sqlc driver for SQLite -- it's not a little buggy, it's off the
-- charts buggy, and there's little interest from the maintainers in fixing
-- any of it. We already started using it though, so plough on.
AND (
cast(@queues_excluded_empty AS boolean)
OR river_job.queue NOT IN (sqlc.slice('queues_excluded'))
);
)
AND (/* TEMPLATE_BEGIN: queues_excluded_clause */ true /* TEMPLATE_END */)
AND (/* TEMPLATE_BEGIN: queues_included_clause */ true /* TEMPLATE_END */)
ORDER BY id
LIMIT @max
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies to all drivers but I'm leaving it here because it doesn't seem like the pgx driver's JobDeleteBefore was updated in this PR yet. This query is currently targeted at the following index:

"river_job_state_and_finalized_at_index" btree (state, finalized_at) WHERE finalized_at IS NOT NULL

That will break if we're also filtering by queue, and this may turn into a sequential scan.

I haven't checked the other queries you've modified here, but this is an important consideration we'd need to be very careful of for all of them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgentry I ended up having the LLM run some minor benchmarking and checking index use. Here's what it produced:

  When QueuesIncluded is NULL (the default, no LeaderDomain set): The query uses
  river_job_state_and_finalized_at_index as a single index scan, exactly as before. No change in behavior.

  When QueuesIncluded is a concrete array (non-default LeaderDomain): The planner switches to a BitmapOr
  strategy with three legs:
  - cancelled leg: uses river_job_state_and_finalized_at_index on (state, finalized_at)
  - completed leg: uses river_job_prioritized_fetching_index on (state, queue)
  - discarded leg: uses river_job_state_and_finalized_at_index on (state, finalized_at)

  Then finalized_at and queue are applied as heap filters.

  This is actually reasonable planner behavior. For the completed leg, when filtering by specific queues, the
  planner correctly identifies that (state, queue) from the fetching index is more selective than (state,
  finalized_at) — it narrows to both the state and the small set of queues before checking finalized_at. The
  total estimated cost (21.76) is lower than the NULL-case cost (194.13).

Maybe read that over to make sure you're satisfied, but I think even with the added queues, we should still be good here.

Comment on lines +2 to +4
-- Alter `river_leader` to remove check constraint that `name` must be
-- `default`. SQLite doesn't allow schema modifications, so this redefines the
-- table entirely.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLite doesn't allow schema modifications, so this redefines the table entirely.

🤯

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :( Awful.

updatedSQL := sql
updatedSQL = replaceTemplate(updatedSQL, templateBeginEndRE)
updatedSQL = replaceTemplate(updatedSQL, templateRE)
updatedSQL = replaceTemplate(updatedSQL, templateBeginEndRE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with the reordering here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just accidental again :/ Reverted.

Comment thread client.go
Comment on lines +243 to +245
// will continue to run on all leaders regardless of domain. If using this
// feature, it's a good idea to configure ReindexerTimeout on all but a
// single leader domain to river.NeverSchedule().
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are running many pods of the same worker, do we treat as "leader" a pod here or a group of workers? If we are saying that only one pod should be reindexing, then what do we do when the pod dies? Or having let's say 3 reindexer pods and many more that don't would be also alright?

@brandur brandur force-pushed the brandur-leadership-realms branch 2 times, most recently from 3bc1c32 to 1f12a24 Compare May 1, 2026 16:41
@brandur
Copy link
Copy Markdown
Contributor Author

brandur commented May 1, 2026

Okay, I rebased this one to make the proposal a bit more current again in pursuit of coming up with a potential solution for #1105.

I had my LLM try to work some alternative solutions with me, and though it came up with a variety of alternatives, it didn't come up with anything substantially better. It would be better if you could have the client pick a value for leader domain based on each client's configured queues, but as we discussed yesterday, this doesn't work well operationally in case a queue is added or removed from a domain.

It did suggest renaming LeaderDomain to a less-abstract alternative like ClientGroup, and I think that's worth considering, but even that I'm not sure is substantially better. ("Client group" is okay in that it suggests that client's might be grouped together, but I kind of like that "leader domain" suggests concretely that this is related to leadership elections specifically. Maybe something like "leader group" would be better though.)

@brandur brandur force-pushed the brandur-leadership-realms branch from 1f12a24 to be4d567 Compare May 1, 2026 18:12
We've gotten a couple requests so far (see #342 and #1105) to be able to
start multiple River clients targeting different queues within the same
database/schema, and giving them the capacity to operate independently
enough to be functional. This is currently not possible because a single
leader is elected given a single schema and it handles all maintenance
operations including non-queue ones like periodic job enqueuing.

Here, add the idea of a `LeaderDomain`. This lets a user set the
"domain" on which a client will elect its leader and allowing multiple
leaders to be elected in a single schema. Each leader will run its own
maintenance services.

Setting `LeaderDomain` causes the additional effect of having
maintenance services start to operate only on the queues that their
client is configured for. The idea here is to give us backwards
compatibility in that the default behavior (in case of an unset
`LeaderDomain`) is the same, but providing a path for multiple leaders
to be interoperable with each other.

There are still a few edges: for example, reindexing is not queue
specific, so multiple leaders could be running a reindexer. I've
provided guidance in the config documentation that ideally, all clients
but one should have their reindexer disabled.
@brandur brandur force-pushed the brandur-leadership-realms branch from be4d567 to a5c8ae8 Compare May 1, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants